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Abstract 

We present a novel approach for training ker- 
nel Support Vector Machines, establish learn- 
ing runtime guarantees for our method that 
are better then those of any other known 
kernelized SVM optimization approach, and 
show that our method works well in practice 
compared to existing alternatives. 



1. Introduction 

We present a novel algorithm for training kernel Sup- 
port Vector Machines (SVMs). One may view a SVM 
as the bi-criterion optimization problem of seeking a 
predictor with large margin (low norm) on the one 
hand, and small training error on the other. Our 
approach is a stochastic gradient method on a non- 
standard scalarization of this bi-criterion problem. 
In particular, we use the "slack constrained" scalar- 
ized optimization problem introduced by Hazan ct al. 
(2011) where we seek to maximize the classification 
margin, subject to a constraint on the total amount of 
"slack" , i.e. sum of the violations of this margin. Our 
approach is based on an efficient method for comput- 
ing unbiased gradient estimates on the objective. Our 
algorithm can be seen as a generalization of the "Batch 
Perceptron" to the non-separable case (i.e. when errors 
are allowed), made possible by introducing stochas- 
ticity, and we therefore refer to it as the "Stochastic 
Batch Perceptron" (SBP). 

The SBP is fundamentally different from Pegasos 
(Shalev-Shwartz et al., 2011) and other stochastic gra- 
dient approaches to the problem of training SVMs, in 
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that calculating each stochastic gradient estimate still 
requires considering the entire data set. In this re- 
gard, despite its stochasticity, the SBP is very much 
a "batch" rather than "online" algorithm. For a lin- 
ear SVM, each iteration would require runtime linear 
in the training set size, resulting in an unacceptable 
overall runtime. However, in the kernel setting, es- 
sentially all known approaches already require linear 
runtime per iteration. A more careful analysis reveals 
the benefits of the SBP over previous kernel SVM op- 
timization algorithms. 

In order to compare the SBP runtime to the run- 
time of other SVM optimization algorithms, which 
typically work on different scalarizations of the 
bi-criterion problem, we follow Bottou & Bousquet 
(2008); Shalev-Shwartz & Srebro (2008) and compare 
the runtimes required to ensure a generalization error 
of C* + e, assuming the existence of some unknown 
predictor u with norm and expected hinge loss 
C* . The main advantage of the SBP is in the regime 
in which e = Q{C*), i.e. we seek a constant factor ap- 
proximation to the best achievable error (e.g. we would 
like an error of 1.01£*). In this regime, the overall 
SBP runtime is ||u||*/e, compared with ||u||^/e'^ for 
Pegasos and ||u|| /e^ for the best known dual decom- 
position approach. 

2. Setup and Formulations 

Training a SVM amounts to finding a vector w defin- 
ing a classifier x H> sign((w, $ (x))), that on the one 
hand has small norm (corresponding to a large classi- 
fication margin), and on the other has a small training 
error, as measured through the average hinge loss on 
the training sample: C{w) = ^ J2"=i ^ iUi ("w, $ (xi))), 
where each (xi,yi) is a labeled example, and I {a) — 
max (0, 1 — a) is the hinge loss. This is captured by 
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the following bi-criterion optimization problem: 

min , C{w). (2.1) 

We focus on kernelized SVMs, where the feature map 
is specified implicitly via a kernel K{x,x') = 
($ (x), $ (a;')), and assume that K{x,x') < 1. We 
consider only "black box" access to the kernel (i.e. our 
methods work for any kernel, as long as we can com- 
pute K(x,x') efficiently), and in our runtime analy- 
sis treat kernel evaluations as requiring 0(1) runtime. 
Since kernel evaluations dominate the runtime of all 
methods studied (ours as well as previous methods), 
one can also interpret the runtimes as indicating the 
number of required kernel evaluations. To simplify our 
derivation, we often discuss the explicit SVM, using 
$(x), and refer to the kernel only when needed. 

A typical approach to the bi-criterion Problem 2.1 is to 
scalarize it using a parameter A controlling the tradeoff 
between the norm (inverse margin) and the empirical 
error: 

A 1 " 

min + (y, {w, $ (x.,))) (2.2) 

Different values of A correspond to different Pareto 
optimal solutions of Problem 2.1, and the entire Pareto 
front can be explored by varying A. 

We instead consider the "slack constrained" scalar- 
ization (Hazan et al., 2011), where we maximize the 
"margin" subject to a constraint of v on the total 
allowed "slack", corresponding to the average error. 
That is, we aim at maximizing the margin by which 
all points are correctly classified (i.e. the minimal dis- 
tance between a point and the separating hyperplane) , 
after allowing predictions to be corrected by a total 
amount specified by the slack constraint: 

max max min (y^ (w, $ (xi)) + ^i) (2-3) 

wGR'' ?6K" i6{l,...,n} 

subject to: ||w|| < 1, C ^ 0, 1^^ < n// 

In this scalarization, varying ly explores different 
Pareto optimal solutions of Problem 2.1. This is cap- 
tured by the following Lemma, which also quantifies 
how suboptimal solutions of the slack-constrained ob- 
jective correspond to Pareto suboptimal points: 

Lemma 2.1. (Hazan et al., 2011, Lemma 2.1) For 
any u ^ 0, consider Problem 2.3 with v = C (u) / 
Let w be an e-suboptimal solution to this problem with 
objective value 7, and consider the reseated solution 
w = w/"f. Then: 



3. The Stochastic Batch Perceptron 

In this section, we will develop the Stochastic Batch 
Perceptron. We consider Problem 2.3 as optimization 
of the variable w with a single constraint ||w|| < 1, 
with the objective being to maximize: 

n 

f{w) = max min V^p^ (y.^ (w, $ (a;^)) + C^) 

i—l 

(3.1) 

Notice that we replaced the minimization over train- 
ing indices i in Problem 2.3 with an equivalent mini- 
mization over the probability simplex, A" = {p cz : 
\^p = 1}, and that we consider p and ^ to be a part of 
the objective, rather than optimization variables. The 
objective /(w) is a concave function of w, and we are 
maximizing it over a convex constraint \\w\\ < 1, and 
so this is a convex optimization problem in w. 

Our approach will be to perform a stochastic gradi- 
ent update on w at each iteration: take a step in the 
direction specified by an unbiased estimator of a (su- 
per)gradient of f{w), and project back to lluijl < 1. To 
this end, we will need to identify the (super)gradients 
of f{w) and understand how to efficiently calculate 
unbiased estimates of them. 

3.1. Warmup: The Separable Case 

As a warmup, we first consider the separable case, 
where v ^ and no errors are allowed. The objec- 
tive is then: 

f{w) ^ miny, {w, $ {xi)) , (3.2) 

i 

This is simply the "margin" by which all points are 
correctly classified, i.e. 7 s.t. Vi yi {w,^{xi)) > 7. We 
seek a linear predictor w with the largest possible mar- 
gin. It is easy to see that (super)gradients with re- 
spect to w are given by yi^{xi) for any index i attain- 
ing the minimum in Equation 3.2, i.e. by the "most 
poorly classified" point (s). A gradient ascent approach 
would then be to iteratively find such a point, update 
w -h- w + r]yi^{xi), and project back to IjuiH < 1. This 
is akin to a "batch Perceptron" update, which at each 
iteration searches for a violating point and adds it to 
the predictor. 

In the separable case, we could actually use exact su- 
pergradients of the objective. As we shall see, it is 
computationally beneficial in the non-separable case 
to base our steps on unbiased gradient estimates. We 
therefore refer to our method as the "Stochastic Batch 
Perceptron" (SBP), and view it as a generalization of 
the batch Perceptron which uses stochasticity and is 
applicable in the non-separable setting. In the same 
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Figure 1. Illustration of how one finds 5* and p* . The up- 
per curve represents the values of the responses d , listed in 
order of increasing magnitude. The lower curve illustrates 
a minimax optimal probability distribution p* . 

way that the "batch Perceptron" can be used to max- 
imize the margin in the separable case, the SBP can 
be used to obtain any SVM solution along the Pareto 
front of the bi-criterion Problem 2.1. 

3.2. Supergradients of /(w) 

For a fixed w, we define c S M" be the vector of "re- 
sponses" : 

c, = {w, $ (xi)) (3.3) 

Supergradients of f{w) at w can be characterized 
explicitly in terms of minimax-optimal pairs p* and 
^* such that p* = argmiupgA" P*(c -f ^*) and ^* = 
arg maxj>.oa^5<«i. (p* )^ (c + • 

Lemma 3.1 (Proof in Appendix C). For any w, let 
p*,£,* be minimax optimal for Equation 3.1. Then 
^^=iPiyi^ i^i) '■^ supergradient of f{w) at w. 

This suggests a simple method for obtaining unbiased 
estimates of supergradients of f{w): sample a train- 
ing index i with probability p* , and take the stochas- 
tic supergradient to be yi^(xi). The only remaining 
question is how one finds a minimax optimal p* . 

It is possible to find a minimax optimal p* in 0{n) 
time. For any ^, a solution of min^gA" p'^{x + ^) must 
put all of the probability mass on those indices i for 
which Ci + is minimized. Hence, an optimal ^* will 
maximize the minimal value of Ci + ^* . This is illus- 
trated in Figure 1 . The intuition is that the total mass 
nv available to ^ is distributed among the indices as 
if this volume of water were poured into a basin with 
height Ci. The result is that the indices i with the low- 
est responses have columns of water above them such 
that the common surface level of the water is 7. 



Once the "water level" 7 has been determined, the op- 
timal p* must be uniform on those indices i for which 
^* > 0, i.e. for which q < 7, must be zero on all 
i s.t. Ci > 7, and could take any intermediate value 
when Ci = J (that is, for some q > 0, we must have 
Ci < 7 p* = q, Ci = 7 ^- < p* < q, and 
Ci > 7 — ^ p.* = — see Figure 1). In particular, the 
uniform distribution over all indices such that Ci < 7 
is minimax optimal. Notice that in the separable case, 
where no slack is allowed, 7 = mini Ci and any distribu- 
tion supported on the minimizing point (s) is minimax 
optimal, and yi^{xi) is an exact supergradient for such 
an i, as discussed in Section 3.1. 

It is straightforward to find the water level 7 in linear 
time once the responses Ci are sorted (as in Figure 1), 
i.e. with a total runtime of O(nlogn) due to sorting. 
It is also possible to find the water level 7 in linear 
time, without sorting the responses, using a divide- 
and-conquer algorithm, further of which may be found 
in Appendix B. 

3.3. Kernelized Implementation 

In a kernelized SVM, w is an element of an implicit 
space, and cannot be represented explicitly. We there- 
fore represent w as w = X]r=i '^iVi^ i^i)^ ^^nd main- 
tain not w itself, but instead the coefficients ai. Our 
stochastic gradient estimates are always of the form 
yi^(xi) for an index i. Taking a step in this direction 
amounts to simply increasing the corresponding ai. 

We could calculate all the responses Ci at each iteration 
as Ci — 'Y^^^^-^ajyiyjK{xi,Xj). However, this would 
require a quadratic number of kernel evaluations per 
iteration. Instead, as is typically done in kernelized 
SVM implementations, we keep the responses Ci on 
hand, and after each stochastic gradient step of the 
form w ^ w + rjyj^ {xj), we update the responses as: 

c-i ^ Ci -f riyiyjK{xi,Xj) (3.4) 

This involves only n kernel evaluations per iteration. 

In order to project w onto the unit ball, we must ei- 
ther track ||w|| or calculate it from the responses as 
||w|| = X]r=i Q!iCi. Rescaling w so as to project it back 
into \\w\\ < 1 is performed by rescaling all coefficients 
ai and responses Ci, again taking time 0{n) and no 
additional kernel evaluations. 

3.4. Putting it Together 

We are now ready to summarize the SBP algorithm. 
Starting from ~ (so both 0;*^°^ and all responses 
are zero), each iteration proceeds as follows: 

1. Find p* by finding the "water level" 7 from the re- 



The Kernelized Stochastic Batch Perceptron 



sponscs (Section 3.2), and taking p* to be uniform 
on tliose indices for which q < 7. 

2. Sample j ^ p* ■ 

3. Update ^ V (u;**) + r]tyj<^ (xj)), where V 
projects onto the unit ball and Vt = This is 
done by first increasing a ^ a + rjt and updat- 
ing the responses as in Equation 3.4, then calcu- 
lating (Section 3.3) and scaling a and c by 
min(l, 1/ ||w||). 

Updating the responses as in Equation 3.4 requires 
0{n) kernel evaluations (the most computationally ex- 
pensive part) and all other operations require 0{n) 
scalar arithmetic operations. 

Since at each iteration we are just updating using an 
unbiased estimator of a supergradient, we can rely on 
the standard analysis of stochastic gradient descent to 
bound the suboptimality after T iterations: 

Lemma 3.2 (Proof in Appendix C). For any T,S > 0, 

after T iterations of the Stochastic Batch Perceptron, 
with probability at least 1 — S, the average iterate 
Hi = ^ X^tLi (corresponding to a — X]t=i A 
satisfies: f (w) > sup||^||<i f [w) ~ O (y'y log|^ . 

Since each iteration is dominated by n kernel evalu- 
ations, and thus takes linear time (we take a kernel 
evaluation to require 0(1) time), the overall runtime 
to achieve e suboptimality for Problem 2.3 is 0(n/e^). 

3.5. Learning Runtime 

The previous section has given us the runtime for 
obtaining a certain suboptimality of Problem 2.3. 
However, since the suboptimality in this objective 
is not directly comparable to the suboptimality of 
other scalarizations, e.g. Problem 2.2, we follow 
Bottou & Bousquet (2008); Shalev-Shwartz & Srebro 
(2008), and analyze the runtime required to achieve a 
desired generalization performance, instead of that to 
achieve a certain optimization accuracy on the empir- 
ical optimization problem. 

Recall that our true learning objective is to find a 
predictor with low generalization error Lq/i(w) = 
Pr(2:^y) {y {w, $(x)) < 0} with respect to some un- 
known distribution over x, y based on a training set 
drawn i.i.d. from this distribution. We assume that 
there exists some (unknown) predictor u that has 
norm ||u|| and low expected hinge loss £* = C{u) = 
E [£{y (u, <i>(a::)))] (otherwise, there is no point in train- 
ing a SVM) , and analyze the runtime to find a predic- 
tor w with generalization error Cq/i(w) < C* + e. 

In order to understand the SBP runtime, we must 



determine both the required sample size and opti- 
mization accuracy. Following Hazan et al. (2011), and 
based on the generalization guarantees of Srebro et al. 
(2010), using a sample of size: 



= O 



'c* 



(3.5) 



and optimizing the empirical SVM bi-criterion Prob- 
lem 2.1 such that: 



||w|| < 2||w|| ; C{w) -~C{u)< e/2 



(3.6) 



suffices to ensure Cq^i(w) < C* + e with high proba- 
bility. Referring to Lemma 2.1, Equation 3.6 will be 
satisfied for w/7 as long as w optimizes the objective 
of Problem 2.3 to within: 



e/2 



\\u\\{C{u) + e/2) 



> n 



(3.7) 




where the inequality holds with high probability for 
the sample size of Equation 3.5. Plugging this sample 
size and the optimization accuracy of Equation 3.7 into 
the SBP runtime of 0{n/e^) yields the overall runtime: 



(3.8) 



for the SBP to find w such that its rescaling satisfies 
^o/ii'w) < 'C(m) -I- e with high probability. 

In the realizable case, where C* = 0, or more gener- 
ally when we would like to reach C* to within a small 
constant multiplicative factor, we have e = fl{C*), the 
first factor in Equation 3.8 is a constant, and the run- 
time simplifies to 0(||u||'* /e). As we will see in Section 
4, this is a better guarantee than that enjoyed by any 
other SVM optimization approach. 

3.6. Including an Unregularized Bias 

It is possible to use the SBP to train SVMs with a 
bias term, i.e. where one seeks a predictor of the form 
X {{w, $(a;)) + b). We then take stochastic gradient 
steps on: 



max 



mm 



(3.9) 

Pi {yi {w, ^{xi)) + yib + ^i) 



Lemma 3.1 still holds, but we must now find mini- 
max optimal p* and b* . This can be accomplished 
using a modified "water filling" involving two basins. 
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Table 1. Upper bounds, up to log factors, on the run- 
time (number of kernel evaluations) required to achieve 

^O/liw) < C{u)+€. 



Overall 



SBP 

Dual Dccomp. 
SGD on £ 




one containing the positively-classified examples, and 
the other the negatively-classified ones. As in the 
case without an unregularized bias, this can be ac- 
complished in 0{n) time — see Appendix B for details. 

4. Relationship to Other Methods 

We discuss the relationship between the SBP and sev- 
eral other SVM optimization approaches, highlighting 
similarities and key diff'erences, and comparing their 
performance guarantees. 

4.1. SIMBA 

Recently, Hazan et al. (2011) presented SIMBA, a 
method for training linear SVMs based on the same 
"slack constrained" scalarization (Problem 2.3) we use 
here. SIMBA also fully optimizes over the slack vari- 
ables ^ at each iteration, but differs in that, instead 
of fully optimizing over the distribution p (as the SBP 
does), SIMBA updates p using a stochastic mirror de- 
scent step. The predictor w is then updated, as in 
the SBP, using a random example drawn according 
to p. A SBP iteration is thus in a sense more "thor- 
ough" then a SIMBA iteration. The SBP theoretical 
guarantee (Lemma 3.2) is correspondingly better by 
a logarithmic factor (compare to Hazan et al. (2011, 
Theorem 4.3)). All else being equal, we would prefer 
performing a SBP iteration over a SIMBA iteration. 

For linear SVMs, a SIMBA iteration can be performed 
in time 0{n + d). However, fully optimizing p as de- 
scribed in Section 3.2 requires the responses q, and 
calculating or updating all n responses would require 
time 0{nd). In this setting, therefore, a SIMBA iter- 
ation is much more efficient than a SBP iteration. 

In the kernel setting, calculating even a single response 
requires 0{n) kernel evaluation, which is the same cost 
as updating all responses after a change to a single 
coordinate ai (Section 3.3). This makes the responses 
essentially "free" , and gives an advantage to methods 



such as the SBP (and the dual decomposition methods 
discussed below) which make use of the responses. 

Although SIMBA is preferable for linear SVMs, the 
SBP is preferable for kernelized SVMs. It should also 
be noted that SIMBA relies heavily on having direct 
access to features, and that it is therefore not obvious 
how to apply it directly in the kernel setting. 

4.2. Pegasos and SGD on C{w) 

Pegasos (Shalev-Shwartz et al., 2011) is a SGD 
method optimizing the regularized scalarization of 
Problem 2.2. Alternatively, one can perform SGD on 
C{'w) subject to the constraint that ||it;|| < i?, yielding 
similar learning guarantees (e.g. (Zhang, 2004)). At 
each iteration, these algorithms pick an example uni- 
formly at random from the training set. If the margin 
constraint is violated on the example, w is updated by 
adding to it a scaled version of yi^{xi). Then, w is 
scaled and possibly projected back to ||w|| < B. The 
actual update performed at each iteration is thus very 
similar to that of the SBP. The main difference is that 
in Pegasos and related SGD approaches, examples are 
picked uniformly at random, unlike the SBP which 
samples from the set of violating examples. 

In a linear SVM, where ^{xi) E K*^ are given ex- 
plicitly, each Pegasos (or SGD on £{w)) iteration is 
extremely simple and requires runtime which is lin- 
ear in the dimensionality oi ^(xi). A SBP update 
would require calculating and referring to all 0{n) re- 
sponses. However, with access only to kernel evalua- 
tions, even a Pegasos- type update requires either con- 
sidering all support vectors, or alternatively updating 
all responses, and might also take 0{n) time, just like 
the much "smarter" SBP step. 

To understand the learning runtime of such methods 
in the kernel setting, recall that SGD converges to an 
e-accurate solution of the optimization problem after 
at most /e^ iterations. Therefore, the overall run- 
time is n /e^. Gombining this with Equation 3.5 
yields that the runtime requires by SGD to achieve 



a learning accuracy 

When e = il{C*), this scales as compared with 
the 1/e scaling for the SBP (see also Table 1). 

4.3. Dual Decomposition Methods 

Many of the most popular packages for optimiz- 
ing kernel SVMs, including LIBSVM (Chang & Lin, 
2001) and SVM-Light (Joachims, 1998), use dual- 
decomposition approaches. This family of algorithms 
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works on the dual of the scalarization 2.2, given by: 

n ^ n 

max y2a^- - aiajyiyjK{xi,Xj) (4.1) 

L ' A 71 J l—\ i J— 1 

and proceed by iteratively choosing a small working 
set of dual variables a;, and then optimizing over these 
variables while holding all other dual variables fixed. 
At an extreme, SMO (Piatt, 1998) uses a working set 
of the smallest possible size (two in problems with an 
unregularized bias, one in problems without). Most 
dual decomposition approaches rely on having access 
to all the responses q (as in the SBP), and employ 
some heuristic to select variables that are likely to 
enable a significant increase in the dual objective. 

On an objective without an unregularized bias the 
structure of SMO is similar to the SBP: the responses 
a are used to choose a single point j in the training 
set, then aj is updated, and finally the responses are 
updated accordingly. There are two important differ- 
ences, though: how the training example to update is 
chosen, and how the change in is performed. 

SMO updates so as to exactly optimize the dual 
Problem 4.1, while the SBP takes a step along aj so 
as to improve the •primal Problem 2.3. Dual feasibility 
is not maintained, so the SBP has more freedom to use 
large coefficients on a few support vectors, potentially 
resulting in sparser solutions. 

The use of heuristics to choose the training example to 
update makes SMO very difficult to analyze. Although 
it is known to converge linearly after some number of 
iterations (Chen et al., 2006), the number of iterations 
required to reach this phase can be very large (see a 
detailed discussion in Appendix E). To the best of our 
knowledge, the most satisfying analysis for a dual de- 
composition method is the one given in Hush ct al. 
(2006). In terms of learning runtime, this analysis 

yields a runtime of O |^((£(w) + e) /e)^ /e^^ to 

guarantee £o/i('"') ^ 'C(u) -1- e. When e — n{L*), this 
runtime scales as 1/e^, compared with the 1/e guaran- 
tee for the SBP. 

4.4. Stochastic Dual Coordinate Ascent 

Another variant of the dual decomposition approach 
is to choose a single ai randomly at each iteration and 
update it so as to optimize Equation 4.1 (Hsieh et al., 
2008). The advantage here is that we do not need to 
use all of the responses at each iteration, so that if it is 
easy to calculate responses on-demand, as in the case 
of linear SVMs, each SDCA iteration can be calculated 
in time 0{d) (Hsich et al., 2008). In a sense, SDCA 
relates to SMO in a similar fashion that Pegasos re- 



lates to the SBP: SDCA and Pegasos are preferable on 
linear SVMs since they choose working points at ran- 
dom; SMO and the SBP choose working points based 
on more information (namely, the responses), which 
are unnecessarily expensive to compute in the linear 
case, but, as discussed earlier, are essentially "free" 
in kernelized implementations. Pegasos and the SBP 
both work on the primal (though on different scalar- 
izations), while SMO and SDCA work on the dual and 
maintain dual feasibility. 

The current best analysis of the runtime of SDCA is 
not satisfying, and yields the bound n/Ae on the num- 
ber of iterations, which is a factor of n larger than the 
bound for Pegasos. Since the cost of each iteration is 
the same, this yields a significantly worse guarantee. 
We do not know if a better guarantee can be derived 
for SDCA. See a detailed discussion in Appendix E. 

4.5. The Online Perceptron 

We have so far considered only the problem of opti- 
mizing the bi-criterion SVM objective of Problem 2.1. 
However, because the online Perceptron achieves the 
same form of learning guarantee (despite not optimiz- 
ing the bi-criterion objective), it is reasonable to con- 
sider it, as well. 

The online Perceptron makes a single pass over the 
training set. At each iteration, if w errs on the 
point under consideration (i.e. yi {w, ^{xi)) < 0), then 
yi^{xi) is added into w. Let M be the number of 
mistakes made by the Perceptron on the sequence of 
examples. Support vectors are added only when a mis- 
take is made, and so each iteration of the Perceptron 
involves at most M kernel evaluations. The total run- 
time is therefore Mn. 

While the Perceptron is an online learning algorithm, 
it can also be used for obtaining guarantees on the 
generalization error using an online-to-batch conver- 
sion (e.g. (Cesa-Bianchi et al., 2001)). 

From a bound on the number of mistakes M (e.g. 
Shalcv-Shwartz (2007, Corollary 5)), it is possible to 
show that the expected number of mistakes the Percep- 
tron makes is upper bounded by nC{u) + \\u\\ ^ nC{u) + 
This implies that the total runtime required 
by the Perceptron to achieve Cofi{w) < C{u) + e is 

O {{{C{u) + e) /ef \\ut /e) . This is of the same or- 
der as the bound we have derived for SBP. However, 
the Perceptron does not converge to a Pareto optimal 
solution to the bi-criterion Problem 2.1, and there- 
fore cannot be considered a SVM optimization pro- 
cedure. Furthermore, the online Perceptron general- 
ization analysis relies on an "online-to-batch" conver- 
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sion technique (e.g. (Cesa-Bianchi et al., 2001)), and 
is therefore vahd only for a single pass over the data. If 
we attempt to run the Perceptron for multiple passes, 
then it might begin to overfit uncontrollably. Although 
the worst-case theoretical guarantee obtained after a 
single pass is indeed similar to that for an optimum of 
the SVM objective, in practice an optimum of the em- 
pirical SVM optimization problem does seem to have 
significantly better generalization performance. 

5. Experiments 

We compared the SBP to other SVM optimization 
approaches on the datasets in Table 2. We com- 
pared to Pegasos (Shalev-Shwartz et al., 2011), SDCA 
(HsiehetaL, 2008), and SMO (Piatt, 1998) with a 
second order heuristic for working point selection 
(Fan et al., 2005). These approaches work on the 
regularized formulation of Problem 2.2 or its dual 
(Problem 4.1). To enable comparison, the parame- 
ter v for the SBP was derived from A as = 
^ Sr^i ^ (j/i (■^*' (^0)); where w* is the known (to 
us) optimum. 

We first compared the methods on a SVM formula- 
tion without an unregularized bias, since Pegasos and 
SDCA do not naturally handle one. So that this 
comparison would be implementation- independent, we 
measure performance in terms of the number of ker- 
nel evaluations. As can be seen in Figure 2, the SBP 
outperforms Pegasos and SDCA, as predicted by the 
upper bounds. The SMO algorithm has a dramatically 
different performance profile, in line with the known 
analysis: it makes relatively little progress, in terms 
of generalization error, until it reaches a certain criti- 
cal point, after which it converges rapidly. Unlike the 
other methods, terminating SMO early in order to ob- 
tain a cruder solution does not appear to be advisable. 

We also compared to the online Perceptron algorithm. 
Although use of the Perceptron is justified for non- 
separable data only if run for a single pass over the 
training set, we did continue running for multiple 
passes. The Perceptron's generalization performance 
is similar to that of the SBP for the first epoch, but 
the SBP continues improving over additional passes. 
As discussed in Section 4.5, the Perceptron is unsafe 
and might overfit after the first epoch, an effect which 
is clearly visible on the Adult dataset. 

To give a sense of actual runtime, we compared our 
implementation of the SBP^ to the SVM package 
LIBSVM, running on an Intel E7500 processor. We 

^Source code is available from 

http: //ttic .uchicago . edu/~ cotter /project s/SBP 



allowed an unregularized bias (since that is what 
LIBSVM uses), and used the parameters in Table 
2. For these experiments, we replaced the Reuters 
dataset with the version of the Forest dataset used 
by Nguyen et al. (2010), using their parameters. LIB- 
SVM converged to a solution with 14.9% error in 195s 
on Aduh, 0.44% in 1980s on MNIST, and 1.8% in 35 
hours on Forest. In one-quarter of each of these run- 
times, SBP obtained 15.0% error on Adult, 0.46% on 
MNIST, and 1.6% on Forest. These resuhs of course 
depend heavily on the specific stopping criterion used. 

6. Summary and Discussion 

The Stochastic Batch Perceptron is a novel approach 
for training kernelized SVMs. The SBP fares well 
empirically, and, as summarized in Table 1, our run- 
time guarantee for the SBP is the best of any existing 
guarantee for kernelized SVM training. An interesting 
open question is whether this runtime is optimal, i.e. 
whether any algorithm relying only on black-box ker- 
nel accesses must perform ^((£* + e)/e)^ /e^ 
kernel evaluations. 

As with other stochastic gradient methods, deciding 
when to terminate SBP optimization is an open issue. 
The most practical approach seems to be to terminate 
when a holdout error stabilizes. We should note that 
even for methods where the duality gap can be used 
(e.g. SMO), this criterion is often too strict, and the 
use of cruder criteria may improve training time. 
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A. Additional Experiments 

While our focus in this paper is on optimization of the 
kernel SVM objective, and not on the broader prob- 
lem of large-scale learning, one may wonder how well 
the SBP compares to techniques which accelerate the 
training of kernel SVMs through approximation. One 
such is the random Fourier projection algorithm of 
Rahimi & Rccht (2007), which can be used to trans- 
form a kernel SVM problem into an approximately- 
equivalent linear SVM. The resulting problem may 
then be optimized using one of the many existing fast 
linear SVM solvers, such as Pegasos, SDCA or SIMBA. 
Unlike methods (such as the SBP) which rely only on 
black-box kernel accesses, Rahimi and Recht's projec- 
tion technique can only be applied on a certain class 
of kernel functions (shift-invariant kernels), of which 
the Gaussian kernel is a member. 

For d-dimensional feature vectors, and using a Gaus- 
sian kernel with parameter tr^, Rahimi and Recht's 
approach is to sample vi, . . . ,Vk E M'' independently 
according to Vi ~ A/'(0, /), and then define the map- 
ping V ■.R'^ ->■ M^fc as: 

^ i^)2i+i sin {vi,x)^ 

Then {V {xi),V {xj)) ~ K {xi, xj), with the quality of 
this approximation improving with increasing k (see 
Rahimi & Rccht (2007, Claim 1) for details). 

Notice that computing each pair of Fourier features 
requires computing the d-dimensional inner product 
{v,x). For comparison, let us write the Gaussian ker- 
nel in the following form: 




The norms ||a;i|| may be cheaply precomputed, so the 
dominant cost of performing a single Gaussian kernel 
evaluation is, likewise, that of the d-dimensional inner 
product {xi, Xj). 

This observation suggests that the computational cost 
of the use of Fourier features may be directly compared 
with that of a kernel-evaluation-based SVM optimizer 
in terms of d-dimensional inner products. Figure 3 
contains such a comparison. In this figure, the com- 
putational cost of a 2fc-dimensional Fourier lineariza- 
tion is taken to be the cost of computing V (xi) on the 
entire training set {kn inner products, where n is the 



number of training examples) — we ignore the cost of 
optimizing the resulting linear SVM entirely. The plot- 
ted testing error is that of the optimum of the resulting 
linear SVM problem, which approximates the original 
kernel SVM. We can see that at least on Reuters and 
MNIST, the SBP is preferable to (i.e. faster than) ap- 
proximating the kernel with random Fourier features. 

B. Implementation Details 

We begin this appendix by providing complete pseudo- 
code, which may be found in Algorithm 1, for the 
SBP algorithm which we outlined in Section 3.4. This 
implementation requires that we be able to find a 
minimax-optimal probability distribution p* to the ob- 
jective of Equation 3.1. 

As was discussed in Section 3.2, in a problem without 
an unregularized bias, such a probability distribution 
can be derived from the "water level" 7, which can 
be found in 0{n) time using Algorithm 2. This algo- 
rithm works by subdividing the set of responses into 
those less than, equal to and greater than a pivot value 
(if one uses the median, which can be found in lin- 
ear time using e.g. the median-of-medians algorithm 
(Blum at al., 1973), then the overall will be linear in 
n). Then, it calculates the size, minimum and sum of 
each of these subsets, from which the total volume of 
the water required to cover the subsets can be easily 
calculated. It then recurses into the subset containing 
the point at which a volume of niy just suffices to cover 
the responses, and continues until 7 is found. 

In Section 3.6, we mentioned that a similar result holds 
for the objective of Equation 3.9, which adds an un- 
regularized bias. 

As before, finding the water level 7 reduces to finding 
minimax-optimal values of p* , £,* and b* . The char- 
acterization of such solutions is similar to that in the 
case without an unregularized bias. In particular, for 
a fixed value of b, we may still think about "pouring 
water into a basin" , except that the height of the basin 
is now Ci + Uib, rather than c;. 

When b is not fixed it is easier to think of two basins, 
one containing the positive examples, and the other 
the negative examples. These basins will be filled with 
water of a total volume of nu, to a common water level 
7. The relative heights of the two basins are deter- 
mined by b: increasing b will raise the basin containing 
the positive examples, while lowering that containing 
the negative examples by the same amount. This is 
illustrated in Figure 4. 

It remains only to determine what characterizes a 
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Figure 3. Classification error on the held-out testing set (linear scale) vs. computational cost measured in units of d- 
dimensional inner products (where the training vectors satisfy x G R'') (log scale), and averaged over ten runs. For the 
Fourier features, the computational cost (horizontal axis) is that of computing k £ {1, 2, 4, 8, ... } pairs of Fourier features 
over the entire training set, while the test error is that of the optimal classifier trained on the resulting linearized SVM 
objective. 



minimax-optimal value of b. Let fc"*" and k~ be the 
number of elements covered by water in the posi- 
tive and negative basins, respectively, for some b. If 
> k~ , then raising the positive basin and lowering 
the negative basin by the same amount (i.e. increas- 
ing b) will raise the overall water level, showing that 
b is not optimal. Hence, for an optimal b, water must 
cover an equal number of indices in each basin. Similar 
reasoning shows that an optimal p* must place equal 
probability mass on each of the two classes. 

Once more, the resulting problem is amenable to a 
divide-and-conquer approach. The water level 7 and 
bias b will be found in 0{n) time by Algorithm 3, pro- 
vided that the partition function chooses the median 
as the pivot. 

C. Proofs of Lemmas 3.1 and 3.2 

Lemma 3.1. For any w, letp*,^* be minimax optimal 
for Equation 3.1. Then Y^'^=iPiyi^ i^i) supergra- 
dient of f{w) at w. 



Proof. By the definition of /, for any v G M.'^: 



f{w + v) 



max_ inin^^pi{yi{w + v,^ (xi)) + 



1=1 



crease the RHS, so: 

n 

fiw-\-v)< max y^P^ {w-hv,^ (xi)) + ^i) 

f ^0,l^£<ni/ — 
■i— 1 

n 

- ^^^^ '^p*^ * + 

6)^0,1-' f <ni/ ^ — ' 
i—1 

n 
i=l 

Because p* is minimax-optimal at w: 

n 

f{w + v) <f{w)+^p*y^{v,<^>{x^)) 
<fH + (v,J2p*y,^{x,)^ 

So J27=iPiyi^ i^i) ^ supergradient of /. □ 

Lemma 3.2. For any T,S > 0, after T iterations 
of the Stochastic Batch Perceptron, with probability at 
least 1 — 5, the average iterate w — ^ X]t=i '^^^^ ( 
responding to a = X]t=i o^'"*'' )' satisfies: f (w) > 
supii^IKi ./ (w) - O fi/i logi 



Substituting the particular value p* for p can only in- 



Proof. Define h = — where / is as in Equation 3.1. 
Then the stated update rules constitute an instance 
of Zinkevich's algorithm, in which steps are taken in 
the direction of stochastic subgradients g^*^ of h at 

The claimed result follows directly from Zinkevich 
(2003, Theorem 1) combined with an online-to-batch 
conversion analysis in the style of Cesa-Bianchi et al. 
(2001, Lemma 1). □ 
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Algorithm 1 Stochastic gradient ascent algoritlim for optimizing tiic kernelized version of Problem 2.3, as described in 
Section 3.3. Here, is the ith standard unit basis vector. The f ind_gamma subroutine finds the "water level" 7 from the 
vector of responses c and total volume m. 

optimize (n : N, xi, . . . , x„ : K'', yi, . . . , y„ : {±1} , Tq : N, T : N, : M+, : M'' x M'' ^ R+) 



1 770 := 1/ ^la&yiiK {xi,Xi); 

2 := 0"; c(°) := 0"; tq 0; 

3 for t := 1 to T 

4 Vt-=Vo/Vt; 

5 7 := f ind_gamma (c^*"-'^^ nz^) ; 

6 Scunple i ^ uniform |j : c^* < 7|; 

7 «(*) — + 7746^; 

8 r2 :^ rti + 277tcf-') + ry^x {x,,x,); 

9 for j = 1 to n 

10 cf^ := c^*"^^ + rityiVjK {xi,Xj); 

11 if (r-t > 1) then 

12 a(*) := (l/rt)aW; := (l/rt)c(*); 1; 

13 a := y StLi o;*'*^ ; c := y StLi c*-*-*; 7 := f ind_gELmma (c, ni/); 

14 return a/7; 



D. Data-Laden Analyses 

We'll begin by presenting a bound on the sample size 
n required to guarantee good generalization perfor- 
mance (in terms of the 0/1 loss) for a classifier which 
is e-suboptimal in terms of the empirical hinge loss. 
The following result, which follows from Srebro et al. 
(2010, Theorem 1), is a vital building block of the 
bounds derived in the remainder of this appendix: 

Lemma D.l. Consider the expected 0/1 and hinge 
losses: 

A/1 [w) = '&x,y [^v{w.x)<o\ 

L (w) — ^x,y [max (0, 1 — y {w, x))] 

Let u he an arbitrary linear classifier, and suppose that 
we sample a training set of size n, with n given by the 
following equation, for parameters B > \\u\\. e > and 

<5e(0,l).- 

(D.l) 

where r > ||a;|| is an upper bound on the radius of the 
data. Then, with probability 1 — 6 over the i.i.d. train- 
ing sample Xi,yi : i £ {l,...,n}, uniformly for all 
linear classifiers w satisfying: 

\\w\\ < B 
C{w)- C (u) < e 



where C is the empirical hinge loss: 
1 " 

C (w) = - y^max (0, 1 - {w,Xi)) 

i=l 

we have that: 

t (u) < £ (u) + e 
^0/1 W) < £ (u) + e 

and in particular that: 

£o/i{w) <£{u) + 2e 

In the remainder of this appendix, we will apply the 
above result to derive generalization bounds on the 
performance of the various algorithms under consider- 
ation, in the data-laden setting. 

D.l. Stochastic Batch Perceptron 

We will here present a more careful derivation of the 
main result of Section 3.5, bounding the generalization 
performance of the SBP. 

Theorem D.2. Let u be an arbitrary linear classi- 
fier in the RKHS, let e > be given, and suppose 
that K{x,x) < with probability 1. There exist 
values of the training size n, iteration count T and 
parameter v such that Algorithm 1 finds a solution 
w = X^Li '^iUi^ i^t) satisfying: 

^0/1 (w) < C{u)+e 
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Algorithm 2 Divide-and-conquer algorithm for finding tiie "water level" 7 from an array of responses C and total 
volume ni'. The partition function chooses a pivot value from the array it receives as an argument (the median would 
be ideal) , places all values less than the pivot at the start of the array, all values greater at the end, and returns the index 
of the pivot in the resulting array. 

f ind_gamma(C : M",ni/ : M) 



1 lower 1; upper :~ n; 

2 lower_max :— —00; lower .sum := 0; 

3 while lower < upper 

4 while lower < upper; 

5 middle :— partition(C [lower : upper]); 

6 middlejmax := max{lowerjmax,C [lower : (middle — 1)]); 

7 middlesum :— lower .sum + [lower : (middle — 1)]; 

8 if middlejmax ■ (middle — 1) — middlesum > nv then 

9 upper := middle ~ 1; 

10 else 

11 lower :— middle; lower jmax :— middlejmax; lower.sum := middle^sum; 

12 return (nv — lower jmax ■ [lower — 1) + lower.sum) / (lower — 1) + lower jmax; 



where >Co/i o.'^^d C are the expected 0/1 and hinge 
losses, respectively, after performing the following 
number of kernel evaluations: 

with the size of the support set of w (the number 
nonzero elements in a) satisfying: 

the above statements holding with probability 1 — 5. 
Proof. For a training set of size n, where: 

taking B = 2[\u[\ in Lemma D.l gives that C(u) < 
C (u) + e and Cq/i ^ ^ i'u) + 2e vifith probability 
1 — 6 over the training sample, uniformly for all linear 
classifiers w such that || w|| < B and C (w) — C (u) < e, 
where C is the empirical hinge loss. We will now show 
that these inequalities are satisfied by the result of 
Algorithm 1. Define: 

w* = argmin C (w) 

w:\\w\\<\\u\\ 

Because w* is a Pareto optimal solution of the bi- 
criterion objective of Problem 2.1, if we choose the 
parameter v to the slack-constrained objective (Prob- 
lem 2.3) such that [[w*[[ v = t (w*), then the optimum 
of the slack-constrained objective will be equivalent to 



w* (Lemma 2.1). As was discussed in Section 3.5, We 
will use Lemma 3.2 to find the number of iterations T 
required to satisfy Equation 3.7 (with u = w*). This 
yields that, if we perform T iterations of Algorithm 1, 
where T satisfies the following: 

T>o|^(^i^y.^||^iriogij (D.2) 

then the resulting solution w — w/j will satisfy: 

[\w[\ < 2||w*|| 
£(w)~ C(w*) <e 

with probability 1 — S. That is: 

||w|| < 2||w*|| 
< B 

and: 

£(w) <C(w*)+e 
<C(u) + e 

These are precisely the bounds on ||w|| and C(w) 
which we determined (at the start of the proof) to 
be necessary to permit us to apply Lemma D.l. Each 
of the T iterations requires n kernel evaluations, so the 
product of the bounds on T and n bounds the number 
of kernel evaluations (we may express Equation D.2 in 
terms of C (u) and ||m|| instead of £(w*) and ||w*||, 
since £(w*) <C(u) <C(u) + e and ||w*|| < |lu|l). 

Because each iteration will add at most one new ele- 
ment to the support set, the size of the support set is 
bounded by the number of iterations, T. 
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Figure 4. Illustration of how one finds the "water level" in a problem with an unregularized bias. The two curves represent 
the heights of two basins of heights d — b and Ci + b, corresponding to the negative and positive examples, respectively, 
with the bias b determining the relative heights of the basins. Optimizing over ^ and p corresponds to filling these two 
basins with water of total volume nv and common water level 7, while optimizing b corresponds to ensuring that water 
covers the same number of indices in each basin. 



This discussion has proved that we can achieve subop- 
timality 2e with probability 1 — 26 with the given #K 
and #S. Because scaling e and 5 by 1/2 only changes 
the resulting bounds by constant factors, these results 
apply equally well for suboptimality e with probability 
1-S. □ 

D.2. Pegasos / SGD on C 

If w is the result of a call to the Pegasos algo- 
rithm (Shalev-Shwartz et al., 2011) without a projec- 
tion step, then the analysis of Kakade & Tewari (2009, 
Corollary 7) permits us to bound the suboptimality 
relative to an arbitrary reference classifier u, with 
probability 1 — 6, as: 



84r2logT, 1 

-la^''^6 



^\\uf + C{u) 1 < 



(D.3) 



dependence on 1/e is linear, accounting for the A 
dependence results in a bound which is not nearly 
good as the above appears. To see this, we'll follow 
Shalev-Shwartz & Srebro (2008) by decomposing the 
suboptimality in the empirical hinge loss as: 



CH-Ciu) = ^-^\\w\f 

-22" " 



— \\u\ 
2 " ' 



In order to have both terms bounded by e/2, we choose 
X — e/ \\u\\ , which reduces the RHS of the above to e. 
Continuing to use this choice of A, we next decompose 
the squared norm of w as: 

^\\n^f^i-Ciw) + Ciu) + ^\\u\f 
<-+C{u) + -\\uf 



< 2 



C{u) 



i"ir 



Equation D.3 implies that, if one performs the fol- 
lowing number of iterations, then the resulting solu- 
tion will be e/2-suboptimal in the regularized objective, 
with probability 1 — 6: 



1 r^ 1 



Here, e bounds the suboptimality not of the em- 
pirical hinge loss, but rather of the regularized ob- 
jective (hinge loss -I- regularization). Although the 



Hence, we will have that: 



II l|2 / o I ^(^) + 1 II l|2 

||w|| < 2 ||m|| 



t{w)- C (m) < e 



(D.4) 



with probability 1 — 6, after performing the following 
number of iterations: 



(D.5) 
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There are two ways in which we wih use this bound on 
T to find bound on the number of kernel evahiations 
required to achieve some desired regularization error. 
The easiest is to note that the bound of Equation D.5 
exceeds that of Lemma D.l, so that if we take T ~ n, 
then with high probabihty, we'll achieve generalization 
error 2e after Tn — kernel evaluations: 

#K^o(^^^Vi^ (D.6) 

Because we take the number of iterations to be pre- 
cisely the same as the number of training examples, 
this is essentially the online stochastic setting. 

Alternatively, we may combine our bound on T with 
Lemma D.l. This yields the following bound on the 
generalization error of Pegasos in the data-laden batch 
setting. 

Theorem D.3. Let u he an arbitrary linear classifier 
in the RKHS, let e > be given, and suppose that 
K{x,x) < with probability 1. There exist values of 
the training size n, iteration count T and parameter 
V such that kernelized Pegasos finds a solution w = 
J2i=i'^iyi^ i^i) satisfying: 

^0/1 (w) < C{u)+e 

where Cq/i and C are the expected 0/1 and hinge 
losses, respectively, after performing the following 
number of kernel evaluations: 

with the size of the support set of w (the number 
nonzero elements in a) satisfying: 

the above statements holding with probability 1 — 6. 

Proof. Same proof technique as in Theorem D. 2. □ 

Because of the extra term in the bound on in 
Equation D.4, theorem D.3 gives a bound which is 
worse by a factor of (£ (u) + e) /e than what we might 
have hoped to recover. When e ^ ^{u), this extra 
factor results in the bound going as rather than 
l/e"^. We need to use Equation D.6 to get a bound 
in this case. 

Although this bound on the generalization perfor- 
mance of Pegasos is not quite what we expected, for 



the related algorithm which performs SGD on the fol- 
lowing objective: 

1 " 

min- V£(?/i (w,$(a;,))) 

2—1 

subject to: \\w\\'^ < 



the same proof technique yields the desired bound (i.e. 
without the extra {C{u) + e) /e factor). This is the 
origin of the "SGD on £" row in Table 1. 

D.3. Perceptron 

Analysis of the venerable online Perceptron algorithm 
is typically presented as a bound on the number of mis- 
takes made by the algorithm in terms of the hinge loss 
of the best classifier — this is precisely the form which 
we consider in this document, despite the fact that the 
online Perceptron does not optimize any scalarization 
of the bi-criterion SVM objective of Problem 2.1. In- 
terestingly, the performance of the Perceptron matches 
that of the SBP, as is shown in the following theorem: 

Theorem D.4. Let u be an arbitrary linear classifier 
in the RKHS, let e > be given, and suppose that 
K (x,x) < r^ with probability 1. There exists a value 
of the training size n such that when the Perceptron 
algorithm is run for a single "pass" over the dataset, 
the result is a solution w = X]r=i '^i'Ui^ i^i) satisfying: 

Cq/i {w) < C(u) + e 

where £o/i ^'^'^ ^ '^^^ expected 0/1 and hinge 
losses, respectively, after performing the following 
number of kernel evaluations: 

#^.o('(£M±i)'llfc£l) 

with the size of the support set of w (the number 
nonzero elements in a) satisfying: 

the above statements holding with probability 1 — 6. 

Proof. If we run the online Perceptron algorithm for 
a single pass over the dataset, then Corollary 5 of 
(Shalcv-Shwartz, 2007) gives the following mistake 
bound, for Ai being the set of iterations on which a 
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mistake is made: 



mistake bound of equation D.7: 



\M\ <nCiu)+r\\u\\ yJnC {u) + \\uf 



\M\ < Y,Hy^{u,^{x,))) 



(D.7) 



n n 

i=l i=l 
n 

+ r\\u\\ E^(y,(w,<i>(a;,)))+r2||u||2 
\ '=1 



Here, I is the hinge loss and £o/i is the 0/1 loss. Di- 
viding through by n: 



^ n 1 ^ 



r \\u\\ 



If we suppose that the Xi,yiS are i.i.d., and that w ~ 
Unif (wi, . . . ,Wn) (this is a "sampling" online-to-batch 
conversion), then: 



Hence, the following will be satisfied: 



E [£0/1 (w)] < C {u) 



(D., 



when: 



< O 



a ii„,ii2 



C{u) + e\ r'\\u\\ 



The expectation is taken over the random sampling 
of w. The number of kernel evaluations performed by 
the ith iteration of the Perceptron will be equal to 
the number of mistakes made before iteration i. This 
quantity is upper bounded by the total number of mis- 
takes made over n iterations, which is given by the 



1 fC{u) + e 



<0 



C(u) + e 



Ciu) + l\ r^\\u\\ 



C{u) + e 



C{u) + eY fC{u) + e 



2 II l|2 

■r \\u\\ 



Ciu)+e\ 2 II ||2 

r \\u\\ 



<0 



The number of mistakes \A4\ is necessarily equal to 
the size of the support set of the resulting classifier. 
Substituting this bound into the number of iterations: 



#K =n\M\ 



<0 



C{u) 



3 ^4 



r \\u\\ 



This holds in expectation, but we can turn this into a 
high-probability result using Markov's inequality, re- 
sulting in in a (5-dependence of y. □ 

Although this result has a (^-dependence of 1/(5, this is 
merely a relic of the simple online-to-batch conversion 
which we use in the analysis. Using a more complex al- 
gorithm (e.g. Ccsa-Bianchi et al. (2001)) would likely 
improve this term to log j- . 

E. Convergence rates of dual 
optimization methods 

In this section we discuss existing analyses of dual 
optimization methods. We first underscore possible 
gaps between dual sub-optimality and primal sub- 
optimality. Therefore, to relate existing analyzes in 
the literature of the dual sub-optimality, we must find 
a way to connect between the dual sub-optimality and 
primal sub-optimality. We do so using a result due to 
Scovel et al. (2008), and based on this result, we derive 
convergence rates on the primal sub-optimality. 

Throughout this section, the "SVM problem" is taken 
to be the regularized objective of Problem 2.2. We 
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denote the primal objective by: 

A 1 " 

P{w) ^ -\\wf + - Y^iiy^w^x,,)) 

i—l 

The dual objective can be written as: 

/ n ^ n \ 

ij=i / 

where Qij = yiUj {xi,Xj), and the dual constraints are 
a E [0, l/(An)]". Finally, by strong duality we have: 

P* = argmaxP(w) = argmax D{a) — D* 

ae[0,l/(An)]" 

E.l. Dual gap vs. Primal gap 

Several authors analyzed the convergence rate of dual 
optimization algorithms. For example, Hsieh et al. 
(2008); CoUins et al. (2008) analyzed the convergence 
rate of SDCA and Chen et al. (2006) analyzed the con- 
vergence rate of SMO-type dual decomposition meth- 
ods. In both cases, the number of iterations required 
so that the dual sub-optimality will be at most e is an- 
alyzed. This is not satisfactory since our goal is to un- 
derstand how many iterations are required to achieve 
a primal sub-optimality of at most e. Indeed, the fol- 
lowing lemma shows that a guarantee on a small dual 
sub-optimality might yield a trivial guarantee on the 
primal sub-optimality. 

Lemma E.l. For every e > 0, there exists a SVM 
problem with a dual solution a that is e-accurate, while 
the corresponding primal solution, w — '^^aiyiXi, is 
at least (1 — e) sub- optimal. Furthermore, the distri- 
bution is such that there exists u with \\u\\ — 1 and 
C{u) = 0, while C{w) = 1 and Co^i{w) = 1/2. 

Proof. Fix some u with = 1 and choose any dis- 
tribution such that C{u) — 0. Take a sample of size 
n from this distribution. A reasonable choice for the 
regularization parameter of SVM in this case is to set 
A = 2e. We have: P* < P{u) = ^WuW^ = e. Now, 
for a = we have D* - D{a) ^ P* — Q < e. There- 
fore, the dual sub-optimality of a = is at most e. 
On the other hand, the corresponding primal solution 
is w = 0, which gives P(0) - P* = 1 - P* > 1 - e. 
Furthermore, £(0) = 1 and £o,i(0) — 1/2, assuming 
that we break ties at random. □ 

In an attempt to connect between dual and primal 
sub-optimality, Scovcl et al. (2008) derived approxi- 
mate duality theorems. This was used by Hush et al. 
(2006, Theorem 2) to show the following: 



Theorem E.2. (Hush et al., 2006, Theorem 2) To 
achieve Ep sub-optimality in the primal, it suffices to 

A 

require a sub-optimality in the dual of e < ^if- 

There is no contradiction to Lemma E.l above since 
in the proof of the lemma we set A = 2e, which yields 
ep > 1. 

E.2. Analyzing the primal sub-optimality of 
dual methods 

Chen et al. (2006) derived the linear convergence of 
SMO-type algorithms. However, the analysis takes the 
following form: 

There are c < 1 and fc, such that for all A: > A: 
it holds that ^(aC^'+i)) - £>* < c{D{a^''^) - 
D*). 

In the above, a*^*^^ is the dual solution after performing 
k iterations, and D* is the optimal dual solution. 

This type of analysis is not satisfactory since k can 
be extremely large and c can be extremely close to 1. 
As an extreme example, suppose that k is exponen- 
tial in n. Then, in any practical implementation of 
the method, we will never reach the regime in which 
the linear convergence result holds. As a less extreme 
example, suppose that k > n? . It follows that we 
might need to calculate the entire Gram matrix be- 
fore the linear convergence analysis kicks in. To make 
more satisfactory statements, we therefore seek conver- 
gence analyses which demonstrate good performance 
not only asymptotically, but also for reasonably small 
values of k. 

Hush et al. (2006) combined explicit convergence rate 
analysis of the dual sub-optimality of certain decom- 
position methods with Theorem E.2. The end result 
is an algorithm with a bound of 0{n) on the number 
of dual iterations, and a total number of kernel evalu- 
ations at training time of O(n^). It also follows that 
the number of support vectors can be order of n. 

Hsieh et al. (2008) analyzed the convergence rate of 
SDCA and derived a bound on the duality sub- 
optimality after performing T iterations. Translating 
their results to our notation and ignoring low order 
terms we obtain: 

eD<-^{{\ma*r + P*) . 
I + n 

where a* is such that w* — J^i o^iViXi- Combining this 
with Theorem E.2 yields that the number of iterations, 
according to this analysis, should be at least 
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So, even if we set ep = P* we still need 



) 



Each iteration of SDCA cost roughly the same as a sin- 
gle iteration of Pegasos. However, Pegasos needs order 
of l/(Aep) iterations, while according to the analysis 
above, SDCA requires factor of n more iterations. We 
suspect that this analysis is not tight. 
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Algorithm 3 Divide-and-conquer algorithm for finding the "water level" 7 and bias b from an array of labels y, array of 
responses C and total volume nu, for a problem with an unregularized bias. The partition function is as in Algorithm 

2. 

f ind_gamma_and_bias (y : {±1}" , C : R", ni/ : R) 



1 C+ :— {C[i] : y[i] = +1}; := |C^|; lower^ :— 1; upper^ := lowerjnax^ :— —00; lower_surn^ := 0; 

2 := {C[i] : y[i] = —1}; W |C |; lower~ := 1; upper^ :— n^; lower jmax~ := —00; lower-Sum~ := 0; 

3 middle'^ := partitioii(C^ [lower^ : upper^]); 

4 middle" := partition(C^ [lower" : upper"]); 

5 middlejnax~^ :— max (C [Zower"*" : {middle~^ — 1)]); middle_sum^ ^ C [Zower^ : (middle^ — 1)]; 

6 middlejnax" := niax{C [lower" : (middle" — 1)]); middle_sum" :— '^C [lower" : (middle" — 1)]; 

7 while (lower^ < upper^) or (lower" < upper") 

8 direction^ := 0; direction" := 0; 

9 if middle'^ < lower" then direction'^ = 1; 

10 else if middle'^ > upper" then direction^ = — 1; 

11 if middle" < lower~^ then direction" = 1; 

12 else if middle" > upper^ then direction" ~ —1; 

13 if direction^ — direction" = then 

14 volumes^ :— middle jmax^ ■ (middle^ — 1) — middlesum^; 

15 volume" := middle_max" ■ (middle" — 1) — middlesum" ; 

16 if volume'^ + volume" > ni' then 

17 if middle'^ > middle" then direction^ = — 1; 

18 else if middle" > middle^ then direction" = —1; 

19 else if upper^ — lower^ > upper" — lower" then direction^ = —1; 

20 else direction" = —1; 

21 else 

22 if middle^ < middle" then direction^ = 1; 

23 else if middle" < middle^ then direction" = 1; 

24 else if upper^ — lower^ > upper" — lower" then direction^ — 1; 

25 else direction" — 1; 

26 if direction^ ^ then 

27 if direction^ > then upper^ :— middle^ — 1; 

28 else lower^ := middle^] lower_max~^ := middle jmax^ ] lower_sum'^ :— middlesum'^ ; 

29 middle^ := partition(C"'' [Zoii;er+ : upper"*"]); 

30 middlejmax^ max (/ower.max"*" , C [Zotwer"*" : (middle^ — 1)]); 

31 middlesum^ lower_sum^ + ^ [Zoiyer"*" : (middle^ — 1)]; 

32 if direction" 7^ then 

33 if direction" > then upper" :— middle" — 1; 

34 else lower" := middle"; lower jmax" := middle -max" ; lower _sum" := middlesum"; 

35 middle" :— partition(C^ [Zower^ ; upper"]); 

36 middlejmax" := max(lower-max" ,C [lower" : (middle" — 1)]); 

37 middlesum" :~ lower sum" + [^ower^ : (middle" — 1)]; 

38 //at this point lower^ = lower" — upper^ — upper" 

39 A7 :— (my + lower sum^ + lowersum") j (lower^ — 1) — lower_max^ — lower jmax" ; 

40 if lower^ < n"*" then A7+ :— min (A7, C^[Zo?i;er+] — lower jmax^) else A7+ :— A7; 

41 if lower" < n" then A7^ :— 'mm(A'y, C"[lower"] ~ lower jmax") else A7^ A7; 

42 7"*" :— lower jmax'^ + 0.5 • (A7 + A7"*" — A7"); ^" :— lower jmax" + 0.5 • (A7 — A7"'" + A7""); 

43 7 := 0.5 • (7"'" + 7-); b := 0.5 • (7"" - 7"*"); 

44 return (7,6); 



